To find patterns and outliers in CSVs and event data, Graphistry provides the hypergraph transform.
As an example, this notebook examines different malware files reported to a security vendor. It reveals phenomena such as:
In [2]:
import pandas as pd
import graphistry as g
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com') #https://www.graphistry.com/api-request
In [7]:
df = pd.read_csv('./barncat.1k.csv', encoding = "utf8")
print("# samples", len(df))
eval(df[:10]['value'].tolist()[0])
Out[7]:
In [8]:
#avoid double counting
df3 = df[df['value'].str.contains("{")]
df3[:1]
Out[8]:
In [9]:
#Unpack 'value' json
import json
df4 = pd.concat([df3.drop('value', axis=1), df3.value.apply(json.loads).apply(pd.Series)])
len(df4)
df4[:1]
Out[9]:
The hypergraph transform creates:
When multiple rows share similar values, they will cluster together. When a row has unique values, those will form a ring around only that node.
In [12]:
g.hypergraph(df4)['graph'].plot()
Out[12]:
We clean up the visualization in a few ways:
Categorize hash codes as in the same family. This simplifies coloring by the generated 'category' field. If columns share the same value, such as two columns using md5 values, this would also cause them to only create 1 node per hash, instead of per-column instance.
Not show a lot of attributes as nodes, such as numbers and dates
Running help(graphistry.hypergraph)
reveals more options.
In [11]:
g.hypergraph(
df4,
opts={
'CATEGORIES': {
'hash': ['sha1', 'sha256', 'md5'],
'section': [x for x in df4.columns if 'section_' in x]
},
'SKIP': ['event_id', 'InstallFlag', 'type', 'val', 'Date', 'date', 'Port', 'FTPPort', 'Origin', 'category', 'comment', 'to_ids']
})['graph'].plot()
Out[11]:
In [13]:
g.hypergraph(
df4,
direct=True,
opts={
'CATEGORIES': {
'hash': ['sha1', 'sha256', 'md5'],
'section': [x for x in df4.columns if 'section_' in x]
},
'SKIP': ['event_id', 'InstallFlag', 'type', 'val', 'Date', 'date', 'Port', 'FTPPort', 'Origin', 'category', 'comment', 'to_ids']
})['graph'].plot()
Out[13]:
In [ ]: